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ABSTRACT 

A  high  order  recursive  neural  network  is  constructed  to  learn  the  task 
of  stereopsis  from  random  dot  stereograms.  The  connection  weights  of  the 
network  are  learned  through  Hebbian  rule.  To  avoid  the  problem  of  over¬ 
whelmingly  large  number  of  weights,  we  exploit  the  translational  symmetry 
and  trained  only  a  small  local  patch  and  later  transported  uniformly  to  the 
whole  network.  Since  the  Hebbian  learning  is  linear,  the  weights  can  be 
calculated  analytically.  The  results  show  that  the  continuity  and  the 
uniqueness  constraints  first  proposed  by  Marr  and  Poggio  are  learned  auto¬ 
matically  . 


INTRODUCTION 

Neural  network  models  have  been  demonstrated  to  be  very  effective  in 
computing  perceptive  problems  such  as  vision,  speech,  and  motor  control.1 
One  of  its  main  advantages  is  the  ability  to  learn  automatically  to  perform 
a  specific  task  by  way  of  a  learning  algorithm.2  In  a  single-layered  per¬ 
ception,  the  error  correction  learning  rule  is  guaranteed  of  convergence  if 
a  solution  exists.  In  a  multi-layered  feed-forward  network,  back  propaga¬ 
tion  of  error  messages  is  used  to  train  the  networks.  These  learning 
methods  are  generally  nonlinear,  making  it  hard  to  analyze  the  final 
network  to  understand  the  working  principle  of  the  networks. 

In  this  paper,  we  would  like  to  study  an  example,  namely,  the 
stereopsis  of  random  dot  images3  wherein  a  recursive  network  is  used.  We 
would  like  to  demonstrate  that  the  algorithm  for  solving  random  dot 
stereogram  can  be  learned  by  the  simple  Hebbian  rule.  Since  the  Hebbian 
rule  involves  neither  error  correction  nor  back  propagation  and  is  local 
and  linear,  it  allows  us  to  calculate  the  connection  weights  analytically. 
The  result  confirms  the  uniqueness  and  the  continuity  constraints  that  Marr 
and  Poggio*4  first  postulated  to  be  in  the  working  principle  of  the 
stereopsis  network.  Through  Hebbian  learning,  these  two  constraints  are 
learned  automatically.  Furthermore,  the  weights  learned  are  symmetrical 
which  ensures  the  convergence  of  the  recursive  network.  The  ^network  we 
used  has  five  depth  layers.  Each  layer  contains  100  x  100  •=  10  pixels  of 
neuron  cells.  If  fully  connected,  it  would  require  2.5  x  10^  connection 
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weights,  a  formidably  large  number.  To  reduce  from  this  large  number  of 
weights,  we  exploit  the  localness  and  the  translational  symmetry  of  the 
stereopsis  problem.  We  train  a  small  patch  of  the  network  with  pixel 
size  a  x  a  (a  <  11)  and  then  transport  it  to  the  whole  image  plane.  The 
result  is  a  high  order  recursive  network  with  local  connection  weights 
uniformly  distributed  in  the  network. 

II.  TRAILING  OF  THE  NETWORK 

The  stereo  vision  is  achieved  by  detecting  the  binocular  disparity  of 
the  two  images  observed  by  the  two  eyes.  The  random  dot  stereogram 
demonstrated  that  the  stereo  perception  is  an  early  vision  problem  that 
does  not  involve  the  high  level  task  of  recognition  and  identification  of 
visual  objects.  In  Fig.  1(a)  and  (b)  we  show  two  random  dot  images  of  the 
left  and  the  right  eyes.  Each  has  100x100  pixels  with  an  equal 
probability  (v  =  1/2)  to  be  white  and  black.  The  actual  three-dimensional 
image,  a  five-layered  cake,  is  shown  in  Fig.  1(c).  Our  task  is  to  train  a 
neural  network  to  construct  the  three-dimensional  image  surfaces  from  the 
two  monocular  random  dot  images.  Marr  and  Poggio  constructed  a  network 
with  connection  weights  designed  from  the  two  constraints,  namely  the 
uniqueness  constraints  which  says  that  in  any  given  direction  we  see  only 
one  image  surface  and  the  continuous  constraint  that  says  a  surface  is 
usually  continuous.  In  this  paper  we  are  going  to  demonstrate  that  these 
two  working  principles  of  stereopsis  can  be  learned  automatically  by  the 
networks  using  Hebbian  learning  rule. 

In  order  to  proceed,  we  first  choose  the  following  conventions  for 
notations  and  representations. 

(i)  In  the  random  dot  images,  a  black  doc  is  assigned  the  value  +1 
and  a  white  dot  -1. 

(ii)  In  each  monocular  image  the  probability  for  a  dot  to  be  black 

is  v . 

(iii)  The  maximum  depth  is  an  integer  D. 

(iv)  The  input  to  the  network  is  a  conjunction  of  the  left  and  the 
right  monocular  images  (see  Fig.  1(d)), 

Ii,k  =  (RL)i,k  ,k  ~  \  ,k  '  Lk  -i+l,k  *  (1) 

~  x  y  x  y  x  y 

where  k  =  (k  ,k  )  is  the  position  vector  of  the  pixel  point  on  a  single 
monocular  imaged  and  i  =  1,2,...D  is  the  index  for  the  different  depth 
level  countered  from  the  bottom  (i  =  l)  to  the  top  (i=D).  R  and  L  denote  the 
right  and  the  left  monocular  images  respectively.  Equation  (1)  implies 
that  the  input  is  composed  of  all  possible  matches  between  the  left  and  the 
right  monocular  images  including  both  the  black  and  the  white  dots. 
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(v)  The  output  of  the  network  also  consists  of  five  (D=5)  layers  of 
planes  with  size  100  x  100.  We  let 


S 


i,k 


s  (i,k)  lies  on  a  solution  surface 

if 

-1  otherwise 


Here  the  solution  surface  means  the  outline  surface  of  the  3-D  object 
viewed  from  above  (as  shown  in  Fig.  1(c))  and  s  is  a  normalization  factor 
such  that  the  total  sum  of  all  the  components  of  the  connection  weight 
would  be  zero  if  s  =  D  -  1. 


We  then  use  the  Hebbian  learning  rule  to  construct  the  weights 


between  S 


i,k 


and  I . 


i,k; j,k' 


\  sh  'W  5  Np<si.t 


(2) 


where  the  summation  is  over  all  possible  pattern  pairs  of  S  and  I,  and  N 
is  the  total  number  of  patterns. 


For  training  patterns,  we  choose  a  small  patch  of  size  axa  with  a  « 
100  to  reduce  the  number  of  weights  to  a  manageable  size.  The  training  is 
done  by  dividing  the  patch  into  four  frontal  planes  with  two  dividing 
lines,  one  horizontal  and  another  vertical.  The  position  of  these  two 
lines  and  the  height  of  the  four  sub-patches  are  uniformly  random. 


To  calculate  the  analytical  weights,  we  note  the  following: 


(i) 


s  (i,k)  lies  on  the  solution  plane 

if 

1  otherwise 


(3) 


«u  <RL)l,k 


f  1 

if  (i,k)  lies  on  tl 

je  solutic 

jn  plane 

with  prob.  p(l)  = 

+  (l-v‘ 

l  i  if  not  on 

l-l 

with  prob.  p(-l)  = 

2v  ( L -v  ) 

(  solution  plane, 

(4) 


(Hi)  if  (i,k)  is  not  on  the  solution  plane,  we  have  the  ensemble  average 


<(RL)i,k>  =  (1-2v)2  * 

(iv)  with  the  patch  axa,  two  points  (kx,ky)  and  (kx+x 
on  the  same  sub-plaue  with  the  probability 

Pd  =  — - - 2  (a— 1— [ x  j ) (a— 1  —  | y | )  , 

(a-1) 


(5) 

ky+y)  would  lie 


(6) 


where  we  have  assumed  that  (kx,k  )  is  at  the  center  of  the  patch  (an 
approximation  justified  by  the  translational  symmetry).  To  proceed,  we 
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consider  two  exclusive  cases  i  =  j  and  i  *  j . 


(a)  diagonal  i  =  j , 


w 

i,k;  i,k+r 


<si.k<EL>i,k«> 


<;> 


Since  can  assume  the  value  s  with  probability  1/D  and  the  value  -1 

with  probability  (D-l)/D,  we  have 


<Sj  u(RU-  =  £  <(RL) .  i _>S,  ,  =  s 


i,k  i,k+r  D 

D~1 


i ,k+r  i ,k 


D  <(RL)i,k+r>S.  ,-i 


(8) 


when  (i,k)  lies  on  a  solution  plane,  the  probability  that  (i,k+r)  is  not  on 
any  solution  plane  would  be 


(1-Pd)(l-1/D)  , 

and  we  have 


<(RL)i,k+rY  =s 
~  ~  i.k 

=  (l-Pd)(l-l/D)(l-2v)2  +  1  -  (1-Pd)(l-1/D)  . 
Similarly,  we  have 

<<RL>l.lt«>S1  „-l  ■  <‘-V  5  +  t1  -  i  <l-Pd)ICl-2 v): 

~  ~  1  ,  K 


(9) 


(10) 


Finally,  we  have 

“l.kil.ktr  ■  "pl^-^O-vKD-l)^  +^2±L  [l-4v(l-v)(l  -I)]}.  (11) 

(b)  i  *  j.  In  this  case  (i,k)  and  (j >k+r)  can  not  lie  on  the  same 
plane.  If  (i,k)  lies  on  a  solution  plane,  the  probability  for  (j,k+r)  to 
be  also  on  a  solution  plane  is 


pi  -  V 


(12) 


On  the  other  hand,  if  (i,k)  is  not  on  a  solution  plane,  then  the 
probability  that  (j,k+r)  would  lie  on  a  solution  plane  is 


p2  ■  pd  <l-pd>  W  • 


(13) 


With  these  probabilities,  we  can  calculate 


<(RL)j,k+r>S.  =s  "  P1  +  d-Ppd-Zv)" 

~  ~  1 ,  k. 

and 

<(RL)],k«>S,  --1  ■  P2  *  U-P2)(1-2v)2  . 

~  ~  i,  k 

Therefore  when  i  *  j ,  we  have 

"l.k;  j.k.r  ’  Np(-lii-4>.(l-»)Pd  ♦  (l-4v(l-v)(l  -i)|}  . 


(14) 


(15) 


(16) 


Combining  Eqs.  (11)  and  (16),  we  have  for  general  i  and  j, 

s+1 
D2 


W.  ,  .  _  =  N  ~  4v(l-v)(D6  .  .-1)P. 

i.fepJpiS+I  P  n2  ij  d 


+  Np  ^~[!-4v(l-v)(l  -£)]  . 


(17) 


The  second  term  in  Eq .  (17)  can  be  ignored  if  we  choose  s  =  D  -  1.  Substi¬ 
tuting  the  expression  of  from  Eq .  (6),  we  get 


W 


i.jj'.J  >£' 


4v( 1-v ) 
(a-1)2 


(18) 


where  kx,ky  is  chosen  as  the  center  of  a  patch  axa  and  k',k'  tun  through 
the  whole  patch.  It  is  easily  seen  that  this  weight  matrix  i^s  symmetrical 
between  i  and  j .  This  is  a  consequence  of  the  translation  and  reflection 
symmetry  in  our  training  patterns.  The  last  factor  in  Eq .  (18)  also  shows 
that  the  weights  between  neurons  on  the  same  layer  are  excitatory  while  for 
neurons  on  different  layers  are  inhibitory.  A  direct  confirmation  of  the 
continuity  and  the  uniqueness  constraint  is  proposed  by  Marr  and  Poggio. 


III.  CONVERGENCE  THEOREM 


Hopfield  is  the  first  to  show  the  convergence  of  an  asynchronous 
symmetrical  recursive  network  with  the  help  of  a  Liapunov  function.  In  our 
problem,  the  weights  are  symmetrical  and  we  can  implement  the  dynamics  of 
our  network  with  two  schemes,  a)  the  maximum  scheme  and  b)  the  threshold 
scheme.  Their  performance  is  similar.  As  a  matter  of  fact,  in  order  to 
enhance  the  stability  of  our  result,  we  choose  to  add  a  forcing  from  the 
input  to  our  network  which  can  he  integrated  into  the  convergence  theorem 
easily . 

The  maximum  scheme  evolves  according  to 
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(19) 


Cj+^  =  max  [  I  W  xj  +  oX°  ] 

1,~  1 <i<D  j ,  k'  J»£  1,~ 


where 


max  (y  )  = 
l<i<D 


1  if  yi  =  ma x ( y .  ;  j=l, . . .D) 
-l  otherwise 


and  X  is  the  input  state  of  the  neurons.  Following  Goles  and  Vichniac, 
we  can’  show  that  this  scheme  is  convergent  with  the  help  of  the  Liapunov 
functi on, 


EM(XC,Xt+1)  =  l  y  Xt+i  W  „  x£ 

M  ’  i,k  i,k;],k  j ,  k 

1  J  >  *J  ^  ~  ~ 

+ 1  +  x;,k)'”ti.k  • 

1  ,  K  ~  ~ 


where  Xt  and  Xc+1  are  the  neuron  states  at  time  step  t  and  t+1, 
respectively.  It  is  then  straightforward  to  show  that  the  updating 
monotonically  increases  the  Liapunov  function  (20)  and  therefore  guarantees 
its  convergence. 

The  advantage  of  the  maximum  scheme  is  its  simDlicity.  It  ensures 
that  the  uniqueness  constraint  is  satisfied  strictly.  However,  since  we 
are  interested  in  the  learning  of  the  network,  to  automatically  implement 
the  uniqueness  constraint,  we  are  also  interested  in  the  threshold  scheme 
which  takes  the  mutual  inhibition  of  the  pixels  along  line  of  sight  into 
account.  The  threshold  scheme  evolves  according  to 


^+l  =  ef  T  W.  ...  ,  ,XC  +  aX,  ,  +  t]  , 


where  9  is  the  step  function  defined  as 


■ 


x  >  0 
x  <  0 


and  t  is  a  threshold  value.  This  scheme  is  also  ensured  of  convergence. 
For  synchronous  updating,  the  Liapunov  function  is  very  similar  to  Eq .  (20) 
and  is  given  by-3 

^  ,„t  „t  +  l.  r  vt+L  „  „t 


*  *  xnk)<"xi.k  +  °  ' 


(V-isn 


Again,  we  can  show  that  AE^,  >  0  sCrictly.  Both  of  the  maximum  and  the 

threshold  scheme  have  been  implemented  and  show  similar  performance. 

IV.  STATIONARY  STATES  AND  THE  NUMERICAL  IMPLEMENTATION 

After  proving  the  convergence  theorem,  the  next  question  to  ask  is 
where  would  the  network  converge  to?  Unfortunately,  the  complete 

discussion  of  the  attractor  states  are  very  tedious  and  is  not  the  main 
interest  of  this  paper.  We  would  only  state  that  the  considerations  of  the 
following  questions  help  us  to  find  the  optimal  choice  of  the  few 
parameters  remain  in  our  network,  namely  the  threshold  value,  the  strength 
of  the  forcing  by  initial  pattern,  and  the  size  of  our  training  patch  a. 

The  questions  involved  are: 

(i)  Are  the  internal  cells  stationary?  An  internal  cell  is  a  pixel 
cell  whose  immediate  neighbors  are  occupied  by  cells  with  the  same  value. 

(li)  Are  the  cells  at  a  boundary  stationary?  As  one  solution 
surface  meets  another  solution  surfacae  at  a  boundary,  this  boundary  nust 
be  stationary. 

(iii)  the  corner  roust  be  stationary.  This  is  a  tougher  condition  to 
be  met  than  (ii). 

(iv)  isolated  dots  are  false  and  should  be  eliminated. 

These  considerations  lead  us  to  the  choice  of  3  <  a  <  13. 

We  have  run  many  numerical  simulations  of  our  networks  with  the  above 
a  values.  Typically,  three  or  four  iterations  are  sufficient  to  attain  the 
final  stationary  state  (see  Fig.  2).  Sometimes,  merely  one  or  two  itera¬ 
tions  already  give  an  almost  perfect  result.  The  remaining  iterations 
serve  only  to  the  removing  of  a  few  isolated  false  spots.  Among  the 
results  obtained  in  terms  of  different  size  and  weights,  the  one  ootained 
with  a  =  5  has  the  cleanest  shape  and  sharpest  corners. 

We  have  also  tested  our  network  with  3-D  images  other  than  rectangles, 
for  instance,  triangles  and  octagons.  The  learned  weights  generalize  well 

for  these  cases  (see  Fig.  3).  We  also  tested  the  robustness  of  the  network 
by  implementing  the  numerically  acquired  weights.  The  performance  is  as 
good  as  those  with  the  analytical  weights. 

V.  CONCLUSION  AND  DISCUSSIONS 

We  demonstrated  in  this  paper  that  using  high  order  recursive  network 
the  connection  weights  can  be  learned  automatically  from  Hebb's  rule  to 
perform  stereopsis  on  random  dot  stereograms.  The  Hebb's  rule  is  linear 
that  allows  us  to  consliact  the  weights  analytically.  The  results  showed 
that  the  continuity  and  the  uniqueness  consraint  first  prooosed  by  Marr  and 
Poggio  are  learned  automatically.  The  weights  obtained  are  symmetrical  and 
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therefore  guarantee  the  convergence  of  the  network..  Numerical  implementa¬ 
tions  confirmed  these  predictions. 
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THE  RANDOM  DOT  STEREO  VISION 

Two  images  of  left  and  right  eyes 


(d) 

Figure  1 
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NEURONAL  NETNORK  OF  STEREO  VISION 

THE  5  LAYERS  (LEFT  TO  RIGHT  I.E.  BOTTOM  TO  TOP) 
RESULT  OF  THE  ITERATION  »  0 
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ABSTRACT 

We  propose  a  new  scheme  to  construct  neural  networks  to  classify  pat¬ 
terns.  The  new  scheme  has  several  novel  features  : 

1.  We  focus  attention  on  the  important  attributes  of  patterns  in  ranking 
order.  Extract  the  most  important  ones  first  and  the  less  important 
ones  later. 

2.  In  training  we  use  the  information  as  a  measure  instead  of  the  error 
function. 

3.  A  inulti-perceptron-like  architecture  is  formed  auomaticallv.  Decision 
is  made  according  to  the  tree  structure  of  learned  attributes. 

This  new  scheme  is  expected  to  sclf-organize  and  perform  well  in  large  scale 
problems. 
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1  INTRODUCTION 

It  is  well  known  that  two-layered  perceptron  with  binary  connections  but  no 
hidden  units  is  unsuitable  as  a  classifier  due  to  its  limited  power  (1].  It  cannot 
solve  even  the  simple  exclunve-or  problem.  Two  extensions  have  been  pro¬ 
posed  to  remedy  this  problem.  The  first  is  to  use  higher  order  connections 
[2].  It  has  been  demonstrated  that  high  order  connections  could  in  many 
cases  solve  the  problem  with  speed  and  high  accuracy  (3],  [4],  The  repre¬ 
sentations  in  general  are  more  local  than  distributive.  The  main  drawback 
is  however  the  combinatorial  explosion  of  the  number  of  high-order  terms. 
Some  kind  of  heuristic  judgement  has  to  be  made  in  the  choice  of  these  terms 
to  be  represented  in  the  network. 

A  second  proposal  is  the  multi-layered  binary  network  with  hidden  units 

E.  These  hidden  units  function  as  features  extracted  from  the  bottom  input 
yer  to  facilitate  the  classification  of  patterns  by  the  output  units.  In  order 
to  train  the  weights,  learning  algorithms  have  been  proposed  that  back- 
propagate  the  errors  from  the  visible  output  layer  to  the  hidden  layers  for 
eventual  adaptation  to  the  desired  values.  The  multi-layered  networks  enjoy 
great  popularity  in  their  flexibility. 

However,  there  are  also  problems  in  implementing  the  multi-layered  nets. 
Firstly,  there  is  the  problem  of  allocating  the  resources.  Namely,  how  many 
hidden  units  would  be  optimal  for  a  particular  problem.  If  we  allocate  too 
many,  it  is  not  only  wasteful  but  also  could  negatively  affect  the  performance 
of  the  network.  Since  too  many  hidden  units  implies  too  many  free  param¬ 
eters  to  fit  specifically  the  training  patterns.  Their  ability  to  generalize  to 
noval  test  patterns  would  be  adversely  affected.  On  the  other  hand,  if  too 
few  hidden  units  were  allocated  then  the  network  would  not  have  the  power 
even  to  represent  the  trainig  set.  How  could  one  judge  beforehand  how  many 
are  needed  in  solving  a  problem?  This  is  similar  to  the  problem  encountered 
in  the  high  order  net  in  its  choice  of  high  order  terms  to  be  represented. 

Secondly,  there  is  also  the  problem  of  scaling  up  the  network.  Since  the 
network  represents  a  parallel  or  coorperative  process  of  the  whole  system, 
each  added  unit  would  interact  with  every  other  units.  This  would  become 
a  serious  problem  when  the  size  of  our  patterns  becomes  large. 

Thirdly,  there  is  no  sequential  communication  among  the  patterns  in  the 
conventional  network.  To  accomplish  a  cognitive  function  we  would  need 
the  patterns  to  interact  and  communicate  with  each  other  as  the  human 
reasoning  does.  It  is  difficult  to  envision  such  an  intcracton  in  current  systems 
which  are  basically  input-output  mappings. 


2  THE  NEW  SCHEME 

In  this  paper,  we  would  like  to  propose  a  scheme  that  constructs  a  network 
taking  advantages  of  both  the  parallel  and  the  sequential  processes. 

We  note  that  in  order  to  classify  patterns,  one  has  to  extract  the  intrinsic 
features,  which  we  call  attributes.  For  a  complex  pattern  set,  there  may 
be  a  large  number  of  attributes.  Out  differnt  attributes  may  have  different 
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ranking  of  importance.  Instead  of  extracing  them  all  simultaneously  it  may 
be  wiser  to  extract  them  sequentially  in  order  of  its  importance  [6],  {?].  Here 
the  importance  of  an  attribute  is  determined  by  its  ability  to  partition  the 
pattern  set  into  sub-categories.  A  measure  of  this  ability  of  a  processing  unit 
should  be  based  on  the  extracted  information.  For  simplicity,  let  us  assume 
that  there  are  only  two  categories  so  that  the  units  have  only  binary  output 
values  1  and  0  (  but  the  input  patterns  may  have  analog  representations).  We 
call  these  units,  including  their  connection  weights  to  the  input  layer,  nodes. 
For  given  connection  weights,  the  patterns  that  are  classified  by  a  node  as 
in  category  1  may  have  their  true  classifications  either  1  or  0.  Similarly,  the 
patterns  that  are  classified  by  a  node  as  in  category  0  may  also  have  their 
true  classifications  either  1  or  0.  As  a  result,  four  groups  of  patterns  are 
formed:  (1,1),  (0,0),  (1,0),  (0.1).  We  then  need  to  judge  on  the  efficiency  of 
the  node  by  its  ability  to  split  these  patterns  optimally.  To  do  this  we  shall 
construct  the  impurity  fuctions  for  the  node.  Before  splitting,  the  impurity 
of  the  input  patterns  reaching  the  node  is  given  by 

h  =  -P\logP\  -  P*logP>  (1) 

where  P{  =  A'f/.Y  is  the  probability  of  being  truely  classified  as  in  category 
1,  and  Pg  =  A’o/iV  is  the  probability  of  being  truely  classified  as  in  category 
0.  After  splitting,  the  patterns  are  channelled  into  two  branches,  the  impurity 
becomes 

A  =  ~P>  Z  PUA)logP(j,l)  -  Po  Z  P(J<  0)loffP(j,  0)  (2) 

J=0.l  j-0,1 

where  P°  =  A ’*/N  is  the  probability  of  being  classified  by  the  node  as  in 
category  1,  Pg  -  A^/.Y  is  the  probability  of  being  classified  by  the  node  as 
in  category  0,  and  P(j,  i)  is  the  probability  of  a  pattern,  which  should  be  in 
category  j,  but  is  classified  by  the  node  as  in  category  i.  The  difference 


A  I  =  h-  h  (3) 

represents  the  decrease  of  the  impurity  at  the  node  after  splitting.  It  is  the 
quantity  that  we  seek  to  optimize  at  each  node.  The  logarithm  in  the  im¬ 
purity  function  come  from  the  information  entropy  of  Shannon  and  Weaver. 
For  all  practical  purpose,  we  found  the  optimization  of  (3)  the  same  as  max¬ 
imizing  the  entropy  [Gj 
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where  N,  is  the  number  of  training  patterns  classified  by  the  node  as  in 
category  i,  Ar,;  is  the  number  of  training  patterns  with  true  classification  in 
category  i  but  classified  by  the  node  as  in  category  j.  Later  we  shall  call  the 
terms  in  the  first  bracket  Si  and  the  second  S 2.  Obviously,  we  have 
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After  we  trained  the  first  unit,  the  training  patterns  were  split  into  two 
branches  by  the  unit.  If  the  classificaton  in  either  one  of  these  two  branches 
is  pure  enough,  or  equivalently  either  one  of  Si  and  92  is  fairly  close  to  1. 
then  we  would  terminate  that  branch  (  or  branches  )  as  a  leaf  of  the  decision 
tree,  and  classify  the  patterns  as  such.  On  the  other  hand,  if  either  branch  is 
not  pure  enough,  we  add  additional  node  to  split  the  pattern  set  further.  The 
subsequent  unit  is  trained  with  only  those  patterns  channeled  through  this 
branch.  These  operations  are  repeated  until  all  the  branches  are  terminated 
as  leaves. 


3  LEARNING  ALGORITHM 

We  used  the  stochastic  gradient  descent  method  to  learn  the  weights  of  each 
node.  The  training  set  for  each  node  are  those  patterns  being  channeled  to 
this  node.  As  stated  in  the  previous  section,  we  seek  to  maximize  the  entropy 
function  S.  The  learning  of  the  weights  is  therefore  conducted  through 

Air,  =  (5) 

Where  77  is  the  learning  rate.  The  gradient  of  S  can  be  calculated  from  the 
following  equation 

i*  =  I  r(1  _  +  (1  _  u. 

d\V,  N  V  ~  N?  d\V,  {  ~N}’d\V,  ' 
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Using  analog  units 


we  have 
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rr  =  Or(l-Or)i; 


Furthermore,  let  Ar  =  1  or  0  being  the  true  answer  for  the  input  pattern  r  , 
then 


N„  =  £  [t.-T  +  (1  -  «)(1  -  A')l[jOr  +  (1  -  J)(l  -Or)l  (9) 

r=  1  J  L 

Substituting  these  into  equation  (5),  we  get 

Air,  =  2,£[2.4'(|i-^)  +  ^-^]o-(i-o-);;  (10) 

In  applying  the  formula  (TO),instcad  of  calculating  the  whole  summation  at 
once,  we  update  the  weights  for  each  pattern  individually.  Meanwhile  we 
update  Nij  in  accord  with  equation  (9). 
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Figure  1:  The  given  classification  tree,  where  dl%d2  and  93  are  chosen  to  be 
all  zeros  in  the  numerical  example. 

4  AN  EXAMPLE 


To  illustrate  our  method,  we  construct  an  example  which  is  itself  a  decision 

tree.  Assuming  there  are  three  hidden  variables  a i,  <22,03,  a  pattern  is  given  a 

by  a  ten-dimensional  vector  A ,  A>  . . . ,  Ao»  constructed  from  the  three  hidden  ei 

variables  as  follows  0 
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A  given  pattern  is  classified  as  either  1  (yes)  or  0  (no)  according  to  the 
corresponding  values  of  the  hidden  variables  ai,aj,a3.  The  actual  decision 

is  derived  from  the  decision  tree  in  Fig.  1.  £ 

In  order  to  learn  this  classification  tree,  we  construct  a  training  set  of  5000 
patterns  generated  by  randomly  chosen  values  ai,a2,a3  in  the  interval  -1  to  \ 

4-1.  We  randomly  choose  tiie  initial  weights  for  each  node,  and  terminate  { 
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Figure  2:  The  learned  classification  tree  structure 

a  branch  as  a  leaf  whenever  the  branch  entropy  is  greater  than  0.S0.  The 
entropy  is  started  at  5  =  0.G5,  and  terminated  at  its  maximum  value  5  = 
0.79  for  the  first  node.  The  two  branches  of  this  node  have  the  entropy 
fuction  valued  at  5 1  =  0.61,52  =  0.87  respectively.  This  corresponds  to 
2446  patterns  channeled  to  the  first  branch  and  2554  to  the  second.  Since 
Si  >  0.80  we  terminate  the  second  branch.  Among  2554  patterns  channeled 
to  the  second  branch  there  are  2519  patterns  with  true  classification  as  no  and 
35  yes  which  are  considered  as  errors.  After  completing  the  whole  training 
process,  there  are  totally  four  nodes  automatically  introduced.  The  final 
result  is  shown  in  a  tree  structure  in  Fig. 2. 

The  total  errors  classified  by  the  learned  tree  arc  3.4  %  of  the  5000  trainig 
patterns.  After  trainig  we  have  tested  the  result  using  10000  novel  patterns, 
the  error  among  which  is  3.2  %. 
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5  SUMMARY 

We  propose  here  a  new  scheme  to  construct  neural  network  that  can  au¬ 
tomatically  learn  the  attributes  sequentially  to  facilitate  the  classification 
of  patterns  according  to  the  ranking  importance  of  each  attribute.  This 
scheme  uses  information  as  a  measure  of  the  performance  of  each  unit.  It  is 


self-organized  into  a  presumably  optimal  structure  for  a  specific  task.  The 
sequential  learning  procedure  focuses  attention  of  the  network  to  the  most 
important  attribute  first  and  then  branches  out'  to  the  less  important  at¬ 
tributes.  This  strategy  of  searching  for  attributes  would  alleviate  the  scale 
up  problem  forced  by  the  overall  parallel  back-propagation  scheme.  It  also 
avoids  the  problem  of  resource  allocation  <  countered  in  the  high-order  net 
and  the  multi-layered  net.  In  the  example  we  showed  the  performance  of  the 
new  method  is  satisfactory.  We  expect  much  better  performance  in  problems 
that  demand  large  size  of  units. 
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ABSTRACT 

We  present  here  a  new  scheme  to  automatically  construct  a  neural  network  archi¬ 
tecture  that  takes  advantage  of  both  the  parallel  and  sequential  strategies  to  solve  a  pat¬ 
tern  classification  or  decision  problem.  The  new  scheme  optimizes  an  entropy  measure 
to  train  nodes  that  extract  attributes  from  the  training  patterns.  The  sequential  extrac¬ 
tion  of  attributes  with  ranking  order  could  alleviate  significantly  the  scale  up  problem 
of  an  all  parallel  network.  Examples  of  decision  tree  problem  demonstrate  amply  the 
superior  performance  of  PSIN  (Parallel  Sequential  Induction  Network)  against  the  usu¬ 
al  back  propagation  procedure  in  multi-layered  networks. 
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ABSTRACT 

We  present  here  a  new  scheme  to  automatically  construct  a  neural  network  ar¬ 
chitecture  that  takes  advantage  of  both  the  parallel  and  sequential  strategies  to 
solve  a  pattern  classification  or  decision  problem.  The  new  scheme  optimizes  an 
entropy  measure  to  train  nodes  that  extract  attributes  from  the  training  patterns. 
The  sequential  extraction  of  attributes  with  ranking  order  could  alleviate  signifi¬ 
cantly  the  scale  up  problem  of  an  all  parallel  network.  Examples  of  decision  tree 
problems  demonstrate  amply  the  superior  performance  of  PSIN  (Parallel  Sequential 
Induction  Network)  against  the  usual  back  propagation  procedure  in  multi-layered 
networks. 

1  Introduction 

When  we  make  up  the  decision  on  a  certain  task,  usually  we  have  to  evaluate  a 
number  of  factors  called  attributes  here.  These  attributes,  in  general,  have  different 
importance  in  helping  us  to  make  up  the  decision.  As  a  matter  of  fact,  some  of  the 
attributes  axe  independent  of  each  other,  and  can  be  evaluated  at  the  same  time, 
t.e.  in  parallel,  while  other  attributes  may  have  to  wait  until  some  preliminary 
decision  based  on  more  important  attributes  on  a  subtask  has  already  been  made. 
This  combination  of  parallel  and  sequential  strategies  in  the  decision  processes  can 
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be  very  efficient  indeed  [1] ,[2] .  On  the  other  hand,  we  note  that  the  neural  network 
studied  so  far  employs  mostly  the  parallel  strategy  [3].  These  are  just  input  output 
mappings,  with  the  neural  network  serving  the  role  of  a  nonlinear  mapping  function. 
There  are  two  major  limitations  of  this  approach.  First,  it  is  difficult  to  identify 
the  proper  network  structure  for  a  given  task.  Second,  the  scaling  up  problem  and 
the  associated  low  learning  speed  are  serious  drawbacks  of  these  networks.  These 
limitations  could  be  alleviated  by  using  a  new  scheme  proposed  in  this  article  [4] 
which  we  call  the  parallel  sequential  induction  network  (  PSIX  ).  PSIN  takes  advan¬ 
tage  of  both  the  parallel  and  sequential  strategies  in  solving  a  problem.  It  consists 
of  many  nodes  that  would  classify  the  incoming  patterns  into  a  few  subcategories. 
Each  node  in  PSIX  is  itself  a  neural  network  with  one  or  more  output  neurons.  It 
is  important  to  notice  that  we  do  not  expect  a  single  node  alone  to  accomplish  the 
entire  decision  or  classification  job.  It  is  the  combination  of  many  such  nodes  that 
would  complement  each  other  and  in  the  end  coorperativelv  get  the  job  done. 


2  The  parallel  sequential  induction  network 


In  a  conventional  neural  network,  train¬ 
ing  (  or  testing)  patterns  are  presented 
to  an  input  layer  and  the  results  read  off 
from  the  output  units.  The  adaptation 
of  the  connection  weights  are  the  conse¬ 
quence  of  all  the  training  patterns  because 
of  the  overall  parallel  strategy.  However, 
it  can  be  seen  that  this  strategy  may  not 
be  the  best  for  problems  represented  by  a 
multi-branchcd  decision  tree.  An  example 
is  shown  in  Fig.l.,  where  ai,a2,a3,...  are 
attributes  to  be  tested  at  the  correspond¬ 
ing  nodes  1,  2,  3,  ...  . 


Fig.  1 


Suppose  there  are  only  two  pattern  classes  yes  and  no.  For  an  input  pattern,  if 
its  test  result  in  node  1  is  positive,  it  is  channeled  to  the  node  2  to  test  for  attribute 
a2.  If  the  result  is  again  positive  we  classify  the  pattern  as  yes  and  so  on.  We  note 
that  in  order  to  classify  a  pattern,  not  all  attributes  are  tested  (  or  relevant).  For 
some  patterns  a,  and  a2  are  the  determinative  attributes  and  a3  is  irrelevant.  But 
for  other  patterns  a ^  and  a3  are  important  and  a2  is  irrelevant. 


If  we  expect  a  three  layered  network  to  be  able  to  extract  automatically  the 
three  attributes  ai,a^,  and  a3,  we  are  actually  assuming  that  the  input  patterns 
are  uniform  in  their  regularity  which  is  however  not  true.  This  problem  would  get 
worse  dramatically  as  the  depth  of  the  decision  tree  is  increased.  The  number  of 
relevant  attributes  become  much  less  than  those  irrelevant  and  their  voting  power 
in  a  decision  process  could  be  swamped  easily  by  noise  signals  from  those  irrelervant 
attributes  in  an  all  parallel  arrangement. 

To  remedy  this  problem,  or  to  recognize  that  some  attributes  are  important  for 
a  fraction  of  the  patterns  but  not  at  all  for  others,  we  propose  the  parallel  sequential 
induction  network.  The  PSIN  divides  the  task  of  classification  of  patterns  into  steps. 
The  first  step  is  to  construct  a  node  that  extract  the  most  important  attributes  for 
the  largest  fraction  of  patterns  from  all  the  input  patterns.  We  use  information 
entropy  to  measure  the  quality  of  this  node.  After  training,  we  expect  the  node 
to  do  its  best  to  classify  the  patterns.  However,  usually  what  the  first  node  could 
accomplish  is  only  to  purify  the  patterns  in  each  branches.  We  set  a  criterion  for 
the  purity.  If  the  purity  in  a  branch  is  lower  than  the  criterion,  we  use  the  subset 
of  patterns  that  were  channeled  into  the  branch  by  the  first  node  to  train  another 
node.  We  again  maximize  the  information  gain  and  purify  the  classification  of  the 
subset  of  patterns.  This  procedure  of  branching  and  purification  is  continued  until 
a  satisfactory  performance  is  achieved. 


3  Training  of  nodes 


The  objective  of  a  node  is  to  purify  the  classification  of  the  patterns  that  come  to 
this  node.  The  purity  of  the  patterns  are  measured  by  an  information  function. 
Before  entering  the  node,  we  assume  there  are  AT+  +  X~  =  .V  number  of  patterns, 
where  X  +  is  the  number  of  yes  patterns  and  X~  is  the  number  of  no  patterns.  The 
information  entropy  function  (  or  impurity  function)  that  characterizes  the  impurity 
of  these  patterns  is  given  by 


„  1  A'+  .Y+  X~  X~ 

Sb  =  lo9—  +  —  (1) 

using  the  Shannon- Weaver  entropy.  In  practice,  we  approximate  the  above  with  a 
simpler  function 
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A  node  is  in  principle  a  neural  network  by  itself.  It  consists  of  an  input  layer  to 
which  the  input  patterns  are  presented  and  an  output  layer  that  may  contain  one 
or  more  output  units.  Since  one  of  the  strategies  of  PSIN  is  to  relieve  the  burden 
of  classification  task  on  a  single  network  to  a  series  of  smaller  networks  (  or  nodes), 
each  node  of  PSIN  may  be  constructed  using  the  simple  perceptron  architecture 
instead  of  other  more  complex  networks.  However,  we  should  note  that  the  training 
of  the  connection  weights  in  this  perceptron-like  node  is  not  the  error  correction 
scheme. 

A  node  consisting  of  n  output  neurons  will  channel  the  patterns  into  2^  possible 
branches.  The  impurity  or  entropy  function  after  the  classification  by  the  node  is 
given  by 


sa 


(3) 


where  j,  (=  0  or  1)  is  the  quantized  output  of  the  itb  neuron,  .V^J2  Jn  is  the  number 
of  yes  patterns  channelled  into  the  branch  with  ith.  neuron  having  the  quantized 
output  j,(i  —  1,2,  ...,n).  The  information  gain  or  the  increase  of  purity  by  the  node 
is  then 


AS  —  Sa  -  Sb. 


(4) 


Since  the  node  cannot  change  Sb,  the  trainning  can  only  alter  Sa  and  we  adapt  the 
weights  to  maximize  Sa. 

It  is  easy  to  show  that 
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Note  also  that 
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where  AT  =  0  or  1,  representing  yes  or  no  respectively,  is  the  true  classification  of 
the  pattern  r  and  0\  is  the  analog  output  of  the  lth  neuron  for  the  pattern  r.  We 
have 

0\  =  - - -  (S) 

l+exp(  1  ; 
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and 
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Therefore,  a  stochastic  gradient  descent  algorithm  can  be  constructed  with 
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where  r]  is  the  learning  rate  and  the  summation  over  training  patterns  is  dropped. 


4  Result 

We  constructed  a  few  examples  and  tested  them  with  the  PSIN  against  the  back 
propagation  on  a  three  layered  net.  PSIN  had  pnwed  to  be  superior  than  back 
propagation  in  all  these  cases. 

These  examples  use  three  attributes  constructed  from  ten  input  units 


ai  —  +  l.ot'2  0.5u3  -f  0.4t\»  +  0.3u5  +  t’6  —  fr  +  2t>g  —  vg  +  0.9i‘io 

a2  =  O.lui  —  0.2i’2  +  0.3ti3  —  0.4u4  +  0.5t’s  —  0.5  -  0.4i>7  —  0.3r8 

+0.2u9  —  O.lujo 


All  these  are  analog  units.  The  componets  of  patterns  vj,j  =  1  to  10  are  generated 
randomly  and  uniformly  between  -1  and  1.  In  learning  these  cases,  we  use  5000 
randomly  generated  training  patterns.  After  learning  we  use  another  set  of  5000 
novel  patterns  to  test  the  result. 

(i)  Second  Order  XOR  Problem 

This  problem  is  represented  in  the  decision  tree  in  Fig. 2.  Using  a  node  with  single 
output  neuron,  PSIN  cannot  improve  the  purity  of  this  node  with  any  significant 
amount.  This  is  expected  because  we  know  that  XOR  problem  cannot  be  solved  by 
perceptron  with  a  single  output  unit.  However,  if  we  use  two  output  neurons  in  the 
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Fig. 


node,  then  immediately  the  PSIN  learned  the  problem  in  only  12  seconds  cpu  time 
on  a  Crav-XMP  with  an  accuracy  of  99.85%  out  of  5000  novel  testing  patterns. 

On  the  other  hand,  using  a  three  layered 
network  with  back  propagation,  we  found 
the  results  depend  significantly  on  the 
number  of  hidden  neurons.  Using  two  hid¬ 
den  neurons,  the  error  is  as  high  as  25.67% 
after  a  trainig  of  10  seconds  cpu  time.  The 
result  of  three  hidden  neurons  do  not  show 
improvement.  The  error  is  still  25.98%  af¬ 
ter  a. 7. 2  seconds  training  time.  Increasing 
the  hidden  neuron  number  to  10,  the  er¬ 
ror  is  reduced  to  2.25%  which  is  still  15 
times  worse  than  the  PSIN  result  and  the 
training  time  is  69.7  seconds  or  about 
seven  times  longer  than  the  PSIN  method. 

(ii).  Third  Order  XOR 

This  problem  is  represented  in  the  deci¬ 
sion  tree  in  Fig. 3.  The  PSIN  approach 
starts  from  the  premise  that  the  problem 
may  be  solved  using  a  node  with  a  sin¬ 
gle  output  neuron.  This  being  impossi¬ 
ble,  the  program  automatically  increases 
the  output  neuron  number  in  the  node 
by  one  to  try  to  improve  its  information 
gain.  In  the  end,  a  node  with  three  output 
neurons  solved  the  problem  with  an  accu¬ 
racy  of  98.9%.  The  cpu  time  including  all 
the  computations  from  one  neuron  node 
to  three  neuron  node  used  is  39  seconds 

on  Cray-XMP.  If  we  knew  before  hand  Fig.  3 

that  we  should  use  only  three  neuron  node,  then  this  time  would  be  cut  to  at  least 
a  half.  On  the  other  hand,  the  automatic  search  for  the  right  number  of  neurons  to 
be  included  in  a  node  is  also  one  of  the  desirable  feature  of  the  PSIN  approach 
The  same  problem  run  on  a  three  layered  network  with  back  propagation  has 
an  error  of  20%  for  three  hidden  neurons,  4.6%  for  six  hidden  neurons  and  the  cpu 
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time  spent  is  already  44.5  seconds.  Increasing  the  hidden  neurons  to  10  results  in 
a  higher  error  rate  of  5.5  ~  7%  and  the  cpu  time  needed  is  around  70  seconds.  We 
could  not  reduce  further  the  error  of  back  propagation  scheme  by  including  more 
hidden  neurons. 

(iii  ).  A  Third  Example 
As  another  example,  we  consider  the  de¬ 
cision  tree  in  Fig  A.  The  PSIN  approach 
automatically  formd  a  two- node  struc¬ 
ture  to  solve  this  problem.  The  structure 
closely  resembles  the  decision  tree.  The 
first  node  consists  of  one  output  neuron 
and  the  second  consists  of  two  output  neu¬ 
rons.  The  network  trained  has  an  error 
rate  of  0.75%.  The  time  spent  is  11.75 
seconds.  The  back  propagation  scheme 
on  a  three  layered  net  with  three  hidden 
neurons  has  an  error  of  7.45%.  The  time 
spent  is  22.7  seconds.  Fig.  4 

5  Conclusion  and  discussion 

In  this  paper,  we  have  presented  a  scheme  called  Parallel  Sequential  Induction 
Network  to  construct  automatically  a  tree  of  neural  network  nodes.  Each  node  is 
trained  to  classify  patterns  channeled  to  it  by  a  previous  node.  A  stochastic  gradient 
descent  algorithm  is  presented  to  optimize  the  information  gain  (or  the  reduction 
of  the  impurity  in  the  pattern  sets)  of  the  node.  The  PSIN  scheme  has  been  tested 
on  a  few  decision  tree  problems  and  show  a  much  superior  result  than  the  three 
layered  network  with  hidden  neurons  and  back  propagation  training. 

We  expect  that  for  complex  decision  problems  the  combined  parallel  and  sequen¬ 
tial  strategy  would  significantly  alleviate  the  scale  up  problem  for  the  all  parallel 
multi-layered  network.  Since  the  scheme  automatically  search  for  the  best  organi¬ 
zation  of  nodes  and  learned  automatically  the  weights  in  each  node  to  determine 
the  important  attributes,  it  can  be  very  useful  in  many  real  life  problems. 
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